Сайт Информационных Технологий

INTELLIGENT CONSTRUCTION OF APPLICATION — ORIENTED DATABASES

V.V.Kluev, S.Y.Garnaev

Saint Petersburg State University, St.Petersburg Institute of Fine Mechanics and Optics

АннотацияВ работе рассмотрен способ автоматизированного формирования предметно – ориентированных баз данных в Internet на основе использования программы – робота. Представлены принципы функционирования программы – робота. Обсужден порядок ее настройки за предметную область, взаимодействие с человеком – экспертом по отбору релевантных документов. Проанализированы результаты экспериментов для ряда предметных областей.

  1. Introduction
  2. Nowadays there are a lot of specialized subject servers on the Internet. Tremendous financial, human and time resources were spent to build thematic collections for them. Traditionally the following schema is used for construction of specialized servers. An expert manually searches relevant information in the Internet. This process takes a lot of time. For example, the database for WWW-server ‘Physics in the Internet’, supported by Russian Basic Research Foundation (project 96-07 89149, http://www.physics.nw.ru) have been built for three years. A database contains 3000 documents. The traditional schema has to be altered if the results of the project ‘Open Architecture Server for Information Search and Delivery’ — OASIS are used. The expert who is responsible for some part of database forms a collection from documents existed in the database. This collection is a core of the most relevant documents from the given topic. All documents are in the same language. For a new topic he/she has to find core of documents in the Internet by traditional manner. Task for the Crawler is to find similar documents in the Internet. The documents retrieved by the Crawler as relevant are moved to the expert so that he/she could make the final decision whether they are worth being included in the database.

     

  3. How does the Crawler work?
  4. The Crawler is a program that recursively retrieves documents from the Internet and checks the found documents for relevance to the database topic. For each document a score is calculated according to the Crawler’s filter. The filter describes the database topic. The filter contains terms with associated term’s weights and threshold. The threshold is a positive number which is used to decide whether a document is relevant enough to the database. The relevance score is calculated using a document-term frequency profile. A document is relevant to database if its score is greater than the threshold. From each retrieved document the Crawler extracts URLs (Uniform Resource Locators) and forms a queue of found URLs. The queue is sorted according to document’s score. The Crawler takes URLs from this queue to retrieve the documents.

    OASIS Crawler has been implemented as a part of the OASIS project [1,2] and is currently in the alpha stage.

  5. Technology of construction of application – oriented databases

Bellow we describe the technology used in the experiments.

  1. Expert manually finds M documents, which are relevant to the database topic. These documents are a core of constructed database.
  2. A dictionary is built automatically for the database core. The stop words (the, or, then and etc.) are deleted. Each word is converted to term. Term is stem of the word. For these purposes the Porter’s stemming algorithm is used [3]. For each term tf-weight is calculated [4]. The dictionary is sorted according to tf-weights.
  3. The filter is formed from first N terms.
  4. The threshold is determined on the base of the run of the Crawler. The process of fitting of the threshold is iterative. After retrieving database core the Crawler stops. The threshold is suitable if recall is K%. (Recall is one of the standard performance measurements for information filtering systems. It is a fraction of the actually relevant documents that are selected). The aim of the filter is to maximize the number of actually relevant documents selected for database and to minimize the “garbage”.
  5. The Crawler is running during H hours or it is running until Z documents will be evaluated. URLs of evaluated and recommended documents are stored. Also, the queue of found URLs are stored. This is important if the Crawler is restarted.
  6. After that duplicates from “mirrors” are automatically deleted and documents are sorted according to directories of their hosts. The aim of the last process is to reduce manually analysis.
  7. The expert manually estimates documents recommended by the Crawler and selects the ones which are relative to the database topic.
  8. This process can be repeated from step 5. (When the Crawler is restarted it uses all the stored information. It never recommends documents, which has already been recommended).
  1. Experiments
  2. To test mentioned above technology, it was necessary to construct test databases consisting of real Web documents with human evaluations of the pages’ topics. Since standard test collection, such as TREC collections, does not exist for real Internet data, some categories in the Yahoo! directory tree were selected and the pages listed in those categories were used to form test databases. The next categories were selected: Benchmarks, Museums, Research Groups, Travelogues, Information Retrieval, Card Games, Programming Languages, Unix-Linux. Documents listed in any given Yahoo! category are assumed to be relevant to the only category’s topic.

    Table 1 shows selected values of parameters mentioned above. The database core contains 100 – 150 documents, the filter is formed from 1000 terms, the threshold provides recall, which equal 50%. The Crawler has run 48 - 385 hours.

    Parameter values selection

    Table 1.

    Parameter

    Value

    M

    100 - 150

    N

    1000

    K

    50%

    H

    48 - 385

    Z

    100000

    Table 2 shows results of experiments for construction of collections. The Crawler had run on the server of Dublin University (hardware-Pentium II, 330MHZ, software-Linux). The process had worked as background process. The speed of collections gathering was very different. In average 4 documents were recommended for each 100 evaluated documents. An amount of duplicate – documents (i.e. documents gathered from “mirrors”) was different.

    Documents recommended for collections Information Retrieval and Physics were totally manually analyzed. For another collections some documents were analyzed from each directory and also all the documents, which did not belong to any directory.

    Manual expert analysis showed that the average percent of recommended documents that were actually junk (i.e. they did not belong to the given topic) was approximately 61%. So, in average four from each ten recommended by the Crawler documents were correct. In other words precision was about 39% (precision is the fraction of the recommended documents that are actually relevant). The best result was showed by the physics collection: “garbage” is about 20%.

    Table 3 shows some interesting results. There were two starts of the Crawler for two collections: Information Retrieval and Card Games. The filter’s length for one of them was 1000 terms, for another was 100. The short filter was constructed manually by expert. To construct topic filter, 100 significant terms were selected from dictionary of the collection. Precision increased for both collections.

     

     

     

     

    Construction of Collection

    Table 2.

    Collection

    Core

    Time:

    Hours

    Evaluat. Docs

    Recom. Docs

    Dup.

    Docs

    Manually

    Analysis

    Relev. Docs

    Precision

    Travelogues

    149

    82

    ~8000

    2883

    62

    Dir. + file.

    553 + 338

    226

    (48 + 137)

    0.08

    Research Groups

    100

    88

    ~22500

    3091

    60

    777 + 134

    811

    (372 + 45)

    0.29

    Museums

    121

    150

    ~75000

    3631

    444

    1780 + 392

    444

    (227 + 39)

    0.14

    Benchmarks

    110

    44

    ~16000

    1233

    350

    136 +33

    649

    (85 + 24)

    0.73

    Programming Languages

    121

    40

    12500

    1004

    22

    331 + 88

    445

    (202 + 18)

    0.45

    Information Retrieval

    110

    385

    ~91000

    594

    14

    580

    202

    0.34

    Physics

    50

    240

    100000

    607

    33

    574

    467

    0.81

    Card Games

    110

    10

    10000

    4170

    114

    134 + 129

    798

    (550+78)

    0.2

    The effect of reduce of filter’s length

    Table 3.

    Collection

    Filter’s length

    Time:

    Hours

    Evaluat. Docs

    Recom. Docs

    Relev. Docs

    Precision

    Information Retrieval

    1000

    385

    91000

    594

    202

    0.34

    Information Retrieval

    100

    47

    < 30000

    479

    229

    0.49

    Card Games

    1000

    10

    10000

    4150

    798

    0.2

    Card Games

    100

    10

    10000

    675

    219

    0.32

     

  3. Conclusion
  4. The OASIS Crawler can independently construct an application – oriented databases of Internet documents on the base of an information filtering technology and heuristics. Preliminary results have shown that this technology has promising performance characteristics. The hypothesis about reducing a length of filter has to be confirmed in the further experimental tests.

     

     

  5. References
  1. M.Bessonov, U.Heuser, I.Nekrestyanov, A. Patel. Open architecture for distributed search systems. In Proc. of the Sixth International Conference on Intelligence in Services and Networks. Apr. 1999.
  2. А.Г.Бабанин, М.Б.Бессонов, В.В.Клюев, Л.А.Петросян, A.Patel, W.Rosental OASIS — система поиска информации в Internet нового поколения. Всероссийская научно – методическая конференция “Интернет и современное общество”. Тезисы докладов. - С. 105 –106. - СПб. - 1998.
  3. M.F.Porter An Algorithm For Sufficx Stripping. Program 14 (3), July 1980, pp. 130 – 137.
  4. G.Salton Automatic text processing: the transformation, analysis, and retrieval of Information by Computer. Reading, MA: Addison-Wesley, 1989.

Site of Information Technologies
Designed by  inftech@webservis.ru.